Compression and the origins of Zipf's law of abbreviation

نویسندگان

  • Ramon Ferrer-i-Cancho
  • Chris Bentz
  • Caio Seguin
چکیده

Languages across the world exhibit Zipf’s law of abbreviation, namely more frequent words tend to be shorter. The generalized version of the law an inverse relationship between the frequency of a unit and its magnitude holds also for the behaviors of other species and the genetic code. The apparent universality of this pattern in human language and its ubiquity in other domains calls for a theoretical understanding of its origins. We generalize the information theoretic concept of mean code length as a mean energetic cost function over the probability and the magnitude of the symbols of the alphabet. We show that the minimization of that cost function and a negative correlation between probability and the magnitude of symbols are intimately related. Introduction. – Zipf’s law of abbreviation, the tendency of more frequent words to be shorter [1], holds in every language for which it was tested [1–9]. A generalized version of the law, i.e. a negative correlation between the frequency of a unit of an alphabet (e.g. the repertoire of vocalization types) and its magnitude (e.g., its length, size or duration), has been found in the behavior of other species [9–13] and in the genetic code [14]. This is strong evidence for a general tendency of more frequent units to be shorter, i.e. less cost-intensive. The robustness and recurrence of this pattern calls for a theoretical understanding of the mechanisms that give rise to it. Here we investigate the law in the light of the problem of compression from standard information theory [15]. The mean code length is defined as

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Compression and the origins of Zipf's law for word frequencies

Here we sketch a new derivation of Zipf’s law for word frequencies based on optimal coding. The structure of the derivation is reminiscent of Mandelbrot’s random typing model but it has multiple advantages over random typing: (1) it starts from realistic cognitive pressures (2) it does not require fine tuning of parameters and (3) it sheds light on the origins of other statistical laws of langu...

متن کامل

The origins of Zipf's meaning-frequency law

In his pioneering research, G. K. Zipf observed that more frequent words tend to have more meanings, and showed that the number of meanings of a word grows as the square root of its frequency. He derived this relationship from two assumptions: that words follow Zipf’s law for word frequencies (a power law dependency between frequency and rank) and Zipf’s law of meaning distribution (a power law...

متن کامل

Least effort and the origins of scaling in human language.

The emergence of a complex language is one of the fundamental events of human evolution, and several remarkable features suggest the presence of fundamental principles of organization. These principles seem to be common to all languages. The best known is the so-called Zipf's law, which states that the frequency of a word decays as a (universal) power law of its rank. The possible origins of th...

متن کامل

Compression as a Universal Principle of Animal Behavior

A key aim in biology and psychology is to identify fundamental principles underpinning the behavior of animals, including humans. Analyses of human language and the behavior of a range of non-human animal species have provided evidence for a common pattern underlying diverse behavioral phenomena: Words follow Zipf's law of brevity (the tendency of more frequently used words to be shorter), and ...

متن کامل

Testing the Robustness of Laws of Polysemy and Brevity Versus Frequency

The pioneering research of G. K. Zipf on the relationship between word frequency and other word features led to the formulation of various linguistic laws. Here we focus on a couple of them: the meaning-frequency law, i.e. the tendency of more frequent words to be more polysemous, and the law of abbreviation, i.e. the tendency of more frequent words to be shorter. Here we evaluate the robustnes...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1504.04884  شماره 

صفحات  -

تاریخ انتشار 2015